Dyr og Data

Data wrangling — joining tables

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2024-09-04

Learning objectives

In this section of the data wrangling topic you will

  • Learn how we join tables together

  • Be familiar with mutating and filtering joins

  • Know the differences between various types of join

  • Know how to join data frames in R using dplyr

Relational Data

New York Flights

nycflights13 contains five tables

  1. flights
  2. airlines
  3. airports
  4. planes
  5. weather

New York Flights

flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

New York Flights

airlines
# A tibble: 16 × 2
   carrier name                       
   <chr>   <chr>                      
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.         

New York Flights

airports
# A tibble: 1,458 × 8
   faa   name                             lat    lon   alt    tz dst   tzone    
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
# ℹ 1,448 more rows

New York Flights

planes
# A tibble: 3,322 × 9
   tailnum  year type              manufacturer model engines seats speed engine
   <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
 1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
# ℹ 3,312 more rows

New York Flights

weather
# A tibble: 26,115 × 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
# ℹ 26,105 more rows
# ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#   visib <dbl>, time_hour <dttm>

New York Flights

Source: Wickham & Grolemund R for Data Science

Keys

Variables that connect each pair of tables are called keys

Each plane is uniquely identified by the tailnum

Two types of key

  1. Primary key — uniquely identifies an observation in it’s own table planes$tailnum
  2. Foreign key — uniquely identifies an observation in another table flights$tailnum

Surrogate keys are keys that are created when a table doesn’t have one or more variables that uniquely identify an observation

mutate(tbl, id = row_number())

Elephants

Animal science Data are regularly stored this way

A study on behaviour in elephants used three tables

  1. qualitaive behavioural analysis
  2. daytime behaviour observations
  3. nighttime behaviour observations

Elephants

glimpse(elephant_day)
Rows: 84
Columns: 18
$ elephant                  <chr> "Elephant1F", "Elephant1F", "Elephant1F", "E…
$ observer                  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ observation               <chr> "Obs1", "Obs2", "Obs3", "Obs4", "Obs5", "Obs…
$ stereotypy                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ wallowing                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ feeding                   <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 4, 1, 2,…
$ foraging                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0,…
$ feedforage                <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 6, 1, 2,…
$ comfort                   <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 3, 0, 1, 0, 0, 0, 0,…
$ comfortwallow             <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 3, 0, 1, 0, 0, 0, 0,…
$ environmental_interaction <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 2, 2, 0, 0, 2, 0, 1,…
$ affiliative               <dbl> 1, 2, 1, 1, 1, 1, 1, 2, 1, 5, 3, 1, 0, 0, 0,…
$ agonistic                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ anticipating              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ locomotion                <dbl> 2, 0, 1, 1, 0, 1, 0, 0, 1, 3, 2, 0, 2, 0, 1,…
$ sleeprest                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,…
$ stereoantic               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ affilagonostic            <dbl> 1, 2, 1, 1, 1, 1, 1, 2, 1, 5, 3, 1, 0, 0, 0,…
glimpse(elephant_qba)
Rows: 72
Columns: 15
$ elephant      <chr> "Elephant1F", "Elephant1F", "Elephant1F", "Elephant1F", …
$ observer      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ observation   <chr> "Obs1", "Obs2", "Obs3", "Obs4", "Obs5", "Obs6", "Obs7", …
$ content       <dbl> 8.6, 7.1, 7.0, 6.4, 8.0, 9.1, 8.6, 8.0, 8.4, 7.9, 9.5, 7…
$ relaxed       <dbl> 8.8, 7.7, 8.0, 7.0, 9.0, 9.5, 9.0, 9.0, 9.7, 8.0, 8.9, 6…
$ uncomfartable <dbl> 0.9, 0.4, 0.8, 1.3, 0.9, 0.4, 0.9, 0.9, 0.3, 0.5, 0.5, 0…
$ agitated      <dbl> 0.2, 1.0, 0.8, 0.7, 0.3, 0.3, 0.2, 0.1, 0.5, 1.0, 0.3, 0…
$ tense         <dbl> 1.5, 3.0, 1.2, 3.2, 0.3, 0.7, 0.2, 1.0, 0.5, 0.3, 0.9, 0…
$ frustrated    <dbl> 0.7, 2.0, 0.2, 0.4, 0.4, 0.3, 0.3, 0.1, 0.2, 0.4, 0.4, 0…
$ wary          <dbl> 0.9, 1.4, 2.9, 2.0, 0.7, 1.2, 0.8, 0.9, 1.0, 2.0, 0.8, 4…
$ playful       <dbl> 0.2, 3.0, 1.5, 0.2, 0.4, 4.1, 1.2, 1.5, 5.7, 0.3, 5.0, 0…
$ sociable      <dbl> 0.2, 0.1, 0.2, 5.8, 0.2, 0.2, 6.5, 0.1, 9.0, 0.1, 0.1, 0…
$ lively        <dbl> 3.7, 5.0, 2.1, 4.0, 1.0, 4.0, 3.0, 1.0, 5.6, 3.8, 5.0, 5…
$ lethargic     <dbl> 0.8, 0.4, 1.1, 1.5, 2.0, 0.8, 0.3, 2.0, 0.1, 0.6, 0.3, 0…
$ observations  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Can you identify the keys in these tables?

Elephants

We don’t know the names of the elephants here, nor do we have metadata about each animal

But we can make some up:

elephant_meta <- tribble(
  ~elephant, ~name, ~sex, ~age,
  "Elephant1F", "Winifred", "female", 45,
  "Elephant2F", "Ellie", "female", 23,
  "Elephant3M", "Dumbo", "male", 57
)

elephant_observer <- tribble(
  ~observer, ~given, ~surname, ~institution,
  1, "Jan", "Lauridesen", "Aalborg Zoo",
  2, "Line", "Larsen", "Zoologisk Have København",
  3, "Charlotte", "Davidson", "Odense Zoo",
  4, "Else", "Jacobsen", "Odense Zoo",
)

Which column(s) are keys in elephant_meta?

Which column(s) are keys in elephant_observer?

Joins

Setting up

To make it clearer what is going on, we’ll simplify the two real elephant data frames to fewer variables

elephant_day2 <- elephant_day |>
  select(elephant, observer, foraging, feeding)

elephant_qba2 <- elephant_qba |>
  select(elephant:relaxed)

Mutating joins

A mutating join combine variables from two data frames:

  1. it first matches observations by their keys, then
  2. copies across variables from one data frame to the other

There are four types of mutating join

But the left join is the one we tend to use most often

Left join

A left join is performed using the left_join() function

We use it like this

tbl1 |> left_join(tbl2, by = join_by())

where tbl1 and tbl2 are the two data frames we want to join and join_by() determines how we will join the data frames

Left join

elephant_day2 |>
  left_join(
    elephant_observer,
    by = join_by(observer == observer)
  )
# A tibble: 84 × 7
   elephant   observer foraging feeding given surname    institution
   <chr>         <dbl>    <dbl>   <dbl> <chr> <chr>      <chr>      
 1 Elephant1F        1        0       2 Jan   Lauridesen Aalborg Zoo
 2 Elephant1F        1        0       1 Jan   Lauridesen Aalborg Zoo
 3 Elephant1F        1        0       1 Jan   Lauridesen Aalborg Zoo
 4 Elephant1F        1        0       1 Jan   Lauridesen Aalborg Zoo
 5 Elephant1F        1        0       1 Jan   Lauridesen Aalborg Zoo
 6 Elephant1F        1        0       1 Jan   Lauridesen Aalborg Zoo
 7 Elephant1F        1        0       1 Jan   Lauridesen Aalborg Zoo
 8 Elephant1F        1        0       1 Jan   Lauridesen Aalborg Zoo
 9 Elephant2F        1        0       2 Jan   Lauridesen Aalborg Zoo
10 Elephant2F        1        0       1 Jan   Lauridesen Aalborg Zoo
# ℹ 74 more rows

By default, all matching columns will be used as the join key, but best to specify which columns you want to use with join_by()

Left join

elephant_day2 |>
  left_join(
    elephant_meta,
    by = join_by(elephant)
  )
# A tibble: 84 × 7
   elephant   observer foraging feeding name     sex      age
   <chr>         <dbl>    <dbl>   <dbl> <chr>    <chr>  <dbl>
 1 Elephant1F        1        0       2 Winifred female    45
 2 Elephant1F        1        0       1 Winifred female    45
 3 Elephant1F        1        0       1 Winifred female    45
 4 Elephant1F        1        0       1 Winifred female    45
 5 Elephant1F        1        0       1 Winifred female    45
 6 Elephant1F        1        0       1 Winifred female    45
 7 Elephant1F        1        0       1 Winifred female    45
 8 Elephant1F        1        0       1 Winifred female    45
 9 Elephant2F        1        0       2 Ellie    female    23
10 Elephant2F        1        0       1 Ellie    female    23
# ℹ 74 more rows

We can just name the variable to match rows on if it is the same in both data sets

Mutating joins

The four mutating joins are

  1. left join — left_join()
  2. right join — right_join()
  3. full join — full_join()
  4. inner join — inner_join()

Mutating joins

tbl1 |> left_join(tbl2, by = join_by())
  • A left join keeps all the rows in tbl1,
  • A right join keeps all rows in tbl2,
  • A full join keeps all rows in either tbl1 or tbl2, and
  • An inner join only keeps rows that occur in both tbl1 and tbl2

Filtering joins

Where mutating joins modify the columns of the focal data frame, filtering joins filter rows of the focal data frame

Two types of filtering join

  1. a semi join — semi_join()
  2. a anti join — anti_join()

Filtering joins

We have four observers in elephant_observer

elephant_observer
# A tibble: 4 × 4
  observer given     surname    institution             
     <dbl> <chr>     <chr>      <chr>                   
1        1 Jan       Lauridesen Aalborg Zoo             
2        2 Line      Larsen     Zoologisk Have København
3        3 Charlotte Davidson   Odense Zoo              
4        4 Else      Jacobsen   Odense Zoo              

but we can filter the observers to keep only those observers that observed the elephants in this study

elephant_observer |>
  semi_join(elephant_day, join_by(observer))
# A tibble: 3 × 4
  observer given     surname    institution             
     <dbl> <chr>     <chr>      <chr>                   
1        1 Jan       Lauridesen Aalborg Zoo             
2        2 Line      Larsen     Zoologisk Have København
3        3 Charlotte Davidson   Odense Zoo              

Filtering joins

Alternatively, we might want to know which observers do not have any observations in the data set

This is what the anti join is for

elephant_observer |>
  anti_join(elephant_day, join_by(observer))
# A tibble: 1 × 4
  observer given surname  institution
     <dbl> <chr> <chr>    <chr>      
1        4 Else  Jacobsen Odense Zoo 

How joins work

To look a little closer at how joins work we’ll use these two data sets

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

How joins work

That is

How joins work

The image shows all potential matches between data frames x and y

To describe a specific type of join, the matches are indicated by dots

How joins work

To describe a specific type of join, the matches are indicated by dots

An inner join returns only those rows with a match in both x and y

How joins work

A left join keeps all the rows of x

Rows in x without a match in y take a value NA in the resulting data frame

How joins work

A right join keeps all the rows of y

Rows in x without a match in y take a value NA in the resulting data frame

How joins work

A full join keeps all the rows of x and y

Rows in one table without a match in the other take a value NA in the resulting data frame

How joins work

We can describe the different mutating joins using venn diagrams

How joins work

What happens if a row in one table matches multiple rows in the other?

Consider a inner join

Outcomes for a row in x:

  • if it doesn’t match anything, it’s dropped.
  • if it matches 1 row in y, it’s preserved.
  • if it matches more than 1 row in y, it’s duplicated once for each match.

How joins work

What if multiple rows in x match multiple rows in y?

We get a many-to-many relationship

Beware!

Relationships

df1 <- tibble(key = c(1, 2, 2), val_x = c("x1", "x2", "x3"))
df2 <- tibble(key = c(1, 2, 2), val_y = c("y1", "y2", "y3"))
  • the first row in df1 only matches one row in df2, but
  • the second and third rows in df1 both match two rows in df2
df1 |> 
  inner_join(df2, join_by(key))
# A tibble: 5 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     2 x2    y3   
4     2 x3    y2   
5     2 x3    y3   

This can cause a combinatoric explosion (all possible combinations of matching rows!)

Relationships

df1 |> 
  inner_join(df2, join_by(key), relationship = "many-to-many")
# A tibble: 5 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     2 x2    y3   
4     2 x3    y2   
5     2 x3    y3   

If you want this kind of relationship, specify it via relationship

Elephants

We can join the two data frames of observations

elephant_day |>
  left_join(
    elephant_qba,
    by = join_by(elephant, observer, observation)
  )
# A tibble: 84 × 30
   elephant   observer observation stereotypy wallowing feeding foraging
   <chr>         <dbl> <chr>            <dbl>     <dbl>   <dbl>    <dbl>
 1 Elephant1F        1 Obs1                 0         0       2        0
 2 Elephant1F        1 Obs2                 0         0       1        0
 3 Elephant1F        1 Obs3                 0         0       1        0
 4 Elephant1F        1 Obs4                 0         0       1        0
 5 Elephant1F        1 Obs5                 0         0       1        0
 6 Elephant1F        1 Obs6                 0         0       1        0
 7 Elephant1F        1 Obs7                 0         0       1        0
 8 Elephant1F        1 Obs8                 0         0       1        0
 9 Elephant2F        1 Obs9                 0         0       2        0
10 Elephant2F        1 Obs10                0         0       1        0
# ℹ 74 more rows
# ℹ 23 more variables: feedforage <dbl>, comfort <dbl>, comfortwallow <dbl>,
#   environmental_interaction <dbl>, affiliative <dbl>, agonistic <dbl>,
#   anticipating <dbl>, locomotion <dbl>, sleeprest <dbl>, stereoantic <dbl>,
#   affilagonostic <dbl>, content <dbl>, relaxed <dbl>, uncomfartable <dbl>,
#   agitated <dbl>, tense <dbl>, frustrated <dbl>, wary <dbl>, playful <dbl>,
#   sociable <dbl>, lively <dbl>, lethargic <dbl>, observations <chr>

How joins work

Filtering joins filter the rows of x to only those with or without a match in y

Here a semi_join() returns two rows because those are the only rows in x that match in y

How joins work

Conversely, an anti_join() returns one row because only that row in x does not have a match in y